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Abstract: In agglomerative hierarchical clustering, pair-group methods suffer 
from a problem of non-uniqueness when two or more distances between different 
clusters coincide during the amalgamation process. The traditional approach for 
solving this drawback has been to take any arbitrary criterion in order to break ties 
between distances, which results in different hierarchical classifications depending 
on the criterion followed. In this article we propose a variable-group algorithm 
that consists in grouping more than two clusters at the same time when ties 
occur. We give a tree representation for the results of the algorithm, which we call 
a multidendrogram, as well as a generalization of the Lance and Williams' formula 
which enables the implementation of the algorithm in a recursive way. 

Keywords: Agglomerative methods; Cluster analysis; Hierarchical classification; Lance and 
Williams' formula; Ties in proximity. 



1 Introduction 

Clustering methods group individuals into groups of individuals or clusters, so that individuals 
in a cluster are close to one another. In agglomerative hierarchical clustering (Cormack 1971; 
Sneath and Sokal 1973, sec. 5.5; Gordon 1999, chap. 4), one begins with a proximity matrix 
between individuals, each one forming a singleton cluster. Then, clusters are themselves 
grouped into groups of clusters or superclusters, the process being repeated until a complete 
hierarchy is formed. Among the different types of agglomerative methods we find single 
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linkage, complete linkage, unweighted average, weighted average, etc., which differ in the 
definition of the proximity measure between clusters. 

Except for the single linkage case, all the other clustering techniques suffer from a non- 
uniqueness problem, sometimes called the ties in proximity problem, which is caused by ties 
either occurring in the initial proximity data or arising during the amalgamation process. 
From the family of agglomerative hierarchical methods, complete linkage is more susceptible 
than other methods to encounter ties during the clustering process, since it does not originate 
new proximity values different from the initial ones. With regard to the presence of ties in 
the original data, they are more frequent when one works with binary variables, or even with 
integer variables comprising just some few distinct values. But they can also appear using 
continuous variables, specially if the precision of experimental data is low. Sometimes, on the 
contrary, the absence of ties might be due to the representation of data with more decimal 
digits than it should be done. The non-uniqueness problem also depends on the measure used 
to obtain the proximity values from the initial variables. Moreover, in general, the larger the 
data set, the more ties arise (MacCuish, Nicolaou and MacCuish 2001). 

The ties in proximity problem is well-known from several studies in different fields, for 
example in biology (Hart 1983; Backeljau, De Bruyn, De Wolf, Jordaens, Van Dongen and 
Winnepenninckx 1996; Arnau, Mars and Marin 2005), in psychology (Van der Kloot, Spaans 
and Heiser 2005), or in chemistry (MacCuish et al. 2001). Nevertheless, this problem is 
frequently ignored in software packages (Morgan and Ray 1995; Backeljau et al. 1996; Van 
der Kloot et al. 2005), and those packages which do not ignore it fail to adopt a common 
standard with respect to ties. Many of them simply break the ties in any arbitrary way, thus 
producing a single hierarchy. In some cases the analysis is repeated a given number of times 
with randomized input data order, and then additional criteria can be used for selecting one 
of the possible solutions (Arnau et al. 2005). In other cases, some requirements are given 
on the number of individuals and the number of characteristics needed to generate proximity 
data without ties (Hart 1983; MacCuish et al. 2001). None of these proposals can ensure the 
complete absence of ties, neither can all their requirements be satisfied always. 

Another possibility for dealing with multiple solutions is to use further criteria, like a 
distortion measure (Cormack 1971, table 3), and select the best solution among all the possible 
ones. However, the result of this approach will depend on the distortion measure used, 
which means that an additional choice must be made. But this proposal does not ensure the 
uniqueness of the solution, since several candidate solutions might share the same minimum 
distortion value. Besides, in ill conditioned problems (those susceptible to the occurrence of 
too many ties), it is not feasible to perform an exhaustive search for all possible hierarchical 
classifications, due to its high computational cost. With regard to this, Van der Kloot et al. 
(2005) analyze two data sets using many random permutations of the input data order, and 
with additional criteria they evaluate the quality of each solution. They show that the best 
solutions frequently emerge after many permutations, and they also notice that the goodness 
of these solutions necessarily depends on the number of permutations used. 

An alternative proposal is to seek a hierarchical classification which describes common 
structure among all the possible solutions, as recommended by Hart (1983). One approach is 
to prune as little as possible from the classifications being compared to arrive at a common 
structure such as the maximal common pruned tree (Morgan and Ray 1995). Care must be 
taken not to prune too much, so this approach can be followed only when the number of 
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alternative solutions is small and they are all known. Furthermore, the maximal common 
pruned tree need not be uniquely defined and it does not give a complete classification for all 
the individuals under study. 

What we propose in this article is an agglomerative hierarchical clustering algorithm that 
solves the ties in proximity problem by merging into the same supercluster all the clusters 
that fall into a tie. In order to do so we must be able to calculate the distance separating 
any two superclusters, hence we have generalized the definition of distance between clusters 
to the superclusters case, for the most commonly used agglomerative hierarchical clustering 
techniques. Additionally, we give the corresponding generalization of Lance and Williams' 
formula, which enables us to compute these distances in a recursive way. Moreover, we 
introduce a new tree representation for the results obtained with the agglomerative algorithm: 
the multidendrogram. 

In Section 2 we introduce our proposal of clustering algorithm and the multidendrogram 
representation for the results. Section 3 gives the corresponding generalization of some hier- 
archical clustering strategies. In Section 4, Lance and Williams' formula is also generalized 
consistently with the new proposal. Section 5 shows some results corresponding to data from 
a real example, and we finish with some conclusions in Section 6. 

2 Agglomerative Hierarchical Algorithm 
2.1 Pair- Group Approach 

Agglomerative hierarchical procedures build a hierarchical classification in a bottom-up way, 
from a proximity matrix containing dissimilarity data between individuals of a set Q = 
{xi,x 2 , ■ ■ ■ ,x n } (the same analysis could be done using similarity data). The algorithm has 
the following steps: 

0) Initialize n singleton clusters with one individual in each of them: {x±}, {x 2 }, 
{x n }. Initialize also the distances between clusters, D({xi}, {xj}), with the values of 
the distances between individuals, d(xi,Xj): 

D({xi}, {xj}) = d(xi,Xj) Vi, j = 1,2, . . . ,n. 

1) Find the shortest distance separating two different clusters. 

2) Select two clusters Xj and separated by such shortest distance and merge them into 
a new supercluster X; L U Xy. 

3) Compute the distances D(Xi U Xii,Xj) between the new supercluster Xi U X$ and each 
of the other clusters Xj. 

4) If all individuals are not in a single cluster yet, then go back to step I. 

Following Sneath and Sokal (1973, p. 216), this type of approach is known as a pair- 
group method, in opposition to variable-group methods which will be discussed in the next 
subsection. Depending on the criterion used for the calculation of distances in step 3, we 
can implement different agglomerative hierarchical methods. In this article we study some 
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of the most commonly used ones, which are: single linkage, complete linkage, unweighted 
average, weighted average, unweighted centroid, weighted centroid and joint between-within. 
The problem of non-uniqueness may arise at step 2 of the algorithm, when two or more pairs 
of clusters are separated by the shortest distance value (i.e., the shortest distance is tied). 
Every choice for breaking ties may have important consequences, because it changes the col- 
lection of clusters and the distances between them, possibly resulting in different hierarchical 
classifications. It must be noted here that not all tied distances will produce ambiguity: they 
have to be the shortest ones and they also have to involve a common cluster. On the other 
hand, ambiguity is not limited to cases with ties in the original proximity values, but ties may 
arise during the clustering process too. 

The use of any hierarchical clustering technique on a finite set Q with n individuals results 
in an n-tree on Q, which is defined as a subset T of parts of Q satisfying the following 
conditions: 

(i) neT, 

(ii) t T, 

(iii) ViGfl {x} G T, 

(iv) vx,reT (xnr = v icy v ycx). 

An n-tree gives only the hierarchical structure of a classification, but the use of a hierarchical 
clustering technique also associates a height h with each of the clusters obtained. All this 
information is gathered in the definition of a valued tree on Q, which is a pair (T, h) where T 
is an n-tree on f2 and h : T — > H. is a function such that VX, Y G T: 

(i) h(X) > 0, 

(ii) h(X) =0 |A| = 1, 

(iii) icy =>. h(x) < h(Y), 

where \X\ denotes the cardinality of A. 

For example, suppose that we have a graph with four individuals like that of Figure 1, 
where the initial distance between any two individuals is the value of the shortest path con- 
necting them. This means, for example, that the initial distance between x 2 and X4 is equal 
to 5. Using the unweighted average criterion, we can obtain three different valued trees. The 
graphical representation of valued trees are the so called dendrograms, and Figure 2 shows the 
three corresponding dendrograms obtained for our toy graph. The first two dendrograms are 
quite similar, but the third one shows a considerably different hierarchical structure. Hence, 
if the third dendrogram is the only one obtained by a software package, one could extract 
from it the wrong conclusion that X3 is closer to £4 than it is to X2- 

2.2 Variable-Group Proposal: Multidendrograms 

Any decision taken to break ties in the toy graph of Figure 1 would be arbitrary. In fact, 
the use of an unfortunate rule might lead us to the worst dendrogram of the three. A logical 
solution to the pair-group criterion problem might be to assign the same importance to all 
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tied distances and, therefore, to use a variable-group criterion. In our example of Figure 1 
this means the amalgamation of individuals Xi, x 2 and x 3 in a single cluster at the same time. 
The immediate consequence is that we have to calculate the distance between the new cluster 
{xi} U {x 2 } U {x 3 } and the cluster {x^}. In the unweighted average case this distance is equal 
to 5, that is, the arithmetic mean among the values 7, 5 and 3, corresponding respectively 
to the distances D({x{\, {2:4}), D({x 2 }, {^4}) and D({x 3 }, {rc 4 }). We must also decide what 
height should be assigned to the new cluster formed by xi, x 2 and x 3 , which could be any value 
between the minimum and the maximum distances that separate any two of them. In this 
case the minimum distance is 2 and corresponds to both of the tied distances D({xi}, {x 2 }) 
and D({x 2 }, {x 3 }), while the maximum distance is the one separating x\ from x 3 and it is 
equal to 4. 

Following the variable-group criterion on a finite set Q with n individuals, we no longer 
get several valued trees, but we obtain a unique tree which we call a multivalued tree on Q, 
and we define it as a triplet (T, hi,h u ) where T is an n-tree on f2 and hi,h u : T — > H. are two 
functions such that VX, Y G T: 

(i) < hiX) < h u {X), 

(ii) h(X) = h u (X) = \X\ = 1, 

(iii) XCY h{X)< h t (Y). 

A multivalued tree associates with every cluster X in the hierarchical classification two height 
values, hi(X) and h u (X), corresponding respectively to the lower and upper bounds at which 
member individuals can be merged into cluster X. When hi(X) and h u (X) coincide for every 
cluster X, the multivalued tree is just a valued tree. But, when there is any cluster X for 
which hi(X) < h u (X), it is like having multiple valued trees because every selection of a height 
h(X) inside the interval [hi(X),h u (X)] corresponds to a different valued tree. The length of 
the interval indicates the degree of heterogeneity inside cluster X. We also introduce here 
the concept of multidendrogram to refer to the graphical representation of a multivalued tree. 
In Figure 3 we show the corresponding multidendrogram for the toy example. The shadowed 
region between heights 2 and 4 refers to the interval between the respective values of hi and h u 
for cluster {xx} U {x 2 } U {2:3}, which in turn also correspond to the minimum and maximum 
distances separating any two of the constituent clusters {x±}, {x 2 } and {x 3 }. 

Let us consider the situation shown in Figure 4, where nine different clusters are to be 
grouped into superclusters. The clusters to be amalgamated should be those separated by the 
shortest distance. The picture shows the edges connecting clusters separated by such shortest 
distance, so we observe that there are six pairs of clusters separated by shortest edges. A 
pair-group clustering algorithm typically would select any of these pairs, for instance (X 8 , Xg), 
and then it would compute the distance between the new supercluster X§ U Xg and the rest 
of the clusters Xi, for all i G {1, 2, . . . , 7}. What we propose here is to follow a variable-group 
criterion and create as many superclusters as groups of clusters connected by shortest edges. 
In Figure 4, for instance, the nine initial clusters would be grouped into the four following 
superclusters: Xi, X 2 U X 3 , I4UI5U X e and X 7 U X 8 U Xg. Then, all the pairwise distances 
between the four superclusters should be computed. In general, we must be able to compute 
distances D(Xj,Xj) between any two superclusters Xj = {J ieI Xi and Xj = [jj eJ Xj, each 
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one of them made up of several clusters indexed by / = i 2 , ■ ■ ■ , i p } and J = j 2 , . . . , j q }, 
respectively. 

The algorithm that we propose in order to ensure uniqueness in agglomerative hierarchical 
clustering has the following steps: 

0) Initialize n singleton clusters with one individual in each of them: {x±}, {x 2 }, 
{x n }. Initialize also the distances between clusters, D({xi}, {xj}), with the values of 
the distances between individuals, d(xi,Xj): 

D({xi}, {xj}) = d(xi,Xj) Vi, j = 1,2, . . . ,n. 

1) Find the shortest distance separating two different clusters, and record it as Di ower . 

2) Select all the groups of clusters separated by shortest distance Di ower and merge them 
into several new superclusters Xj. The result of this step can be some superclusters 
made up of just one single cluster (|/| = 1), as well as some superclusters made up 
of various clusters (|J| > 1). Notice that the latter superclusters all must satisfy the 
condition D min (X/) = Di ower , where 

D m in{Xi) = min min D{X h X il ). 
iei i'ei 

3) Update the distances between clusters following the next substeps: 

3.1) Compute the distances D{X I ,X J ) between all superclusters, and record the mini- 
mum of them as D next (this will be the shortest distance Di ower in the next iteration 
of the algorithm) . 

3.2) For each supercluster Xj made up of various clusters (|/| > 1), assign a common 
amalgamation interval [Di ower , D upper ] for all its constituent clusters Xj, i 6 /, 
where D upper = D max {Xi) and 

D max (X I ) = max max D(X U X V ). 
iei i'ei 

4) If all individuals are not in a single cluster yet, then go back to step 1. 

Using the pair-group algorithm, only the centroid methods (weighted and unweighted) 
may produce reversals. Let us remember that a reversal arises in a valued tree when it 
contains at least two clusters X and Y for which X C Y but h(X) > h(Y) (Morgan and 
Ray 1995). In the case of the variable-group algorithm, reversals may appear in substep 3.2. 
Although reversals make dendrograms difficult to interpret if they occur during the last stages 
of the agglomeration process, it can be argued that they are not very disturbing if they occur 
during the first stages. Thus, as happens with the centroid methods in the pair-group case, it 
could be reasonable to use the variable-group algorithm as long as no reversals at all or only 
unimportant ones were produced. 

Sometimes, in substep 3.2 of the variable-group clustering algorithm, it will not be enough 
to adopt a fusion interval, but it will be necessary to obtain an exact fusion value (e.g., in 
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order to calculate a distortion measure). In these cases, given the lower and upper bounds at 
which the tied clusters can merge into a supercluster, one possibility is to select the fusion 
value naturally suggested by the method being applied. For instance, in the case of the 
toy example and the corresponding multidendrogram shown in Figures 1 and 3, the fusion 
value would be 2.7 (the unweighted average distance). If the clustering method used was a 
different one such as single linkage or complete linkage, then the fusion value would be 2 or 
4, respectively. Another possibility is to use systematically the shortest distance as the fusion 
value for the tied clusters. Both criteria allow the recovering of the pair-group result for the 
single linkage method. The latter criterion, in addition, avoids the appearance of reversals. 
However, it must be emphasized that the adoption of exact fusion values, without considering 
the fusion intervals at their whole lengths, means that some valuable information regarding 
the heterogeneity of the clusters is being lost. 



3 Generalization of Agglomerative Hierarchical Meth- 
ods 

In the variable-group clustering algorithm previously proposed we have seen the necessity of 
agglomerating simultaneously two families of clusters, respectively indexed by / = i 2 , ■ ■ ■ , i p } 
and J = j2, • • • ,jg}, into two superclusters X[ = Xi and Xj = {Jj &J Xj. In the fol- 
lowing subsections we derive, for each of the most commonly used agglomerative hierarchical 
clustering strategies, the distance between the two superclusters, D(X T ,Xj), in terms of the 
distances between the respective component clusters, D(Xi,Xj). 



3.1 Single Linkage 

In single linkage clustering, also called nearest neighbor or minimum method, the distance be- 
tween two clusters Xi and Xj is defined as the distance between the closest pair of individuals, 
one in each cluster: 

D(Xi,Xj) = min min d(x,y). (1) 
This means that the distance between two superclusters Xj and Xj can be defined as 

D(Xj,Xj) = min min d(x,y) = min min min min d(x,y). (2) 

Notice that this formulation generalizes the definition of distance between clusters in the 
sense that equation (jTJ) is recovered from equation (j2j) when \I\ — \J\ — 1, that is, when 
superclusters I and J are both composed of a single cluster. Grouping terms and using the 
definition in equation (JTJ), we get the equivalent definition: 

D{X h Xj) = min min D(X h XA. (3) 

i€l j€J 



3.2 Complete Linkage 

In complete linkage clustering, also known as furthest neighbor or maximum method, cluster 
distance is defined as the distance between the most remote pair of individuals, one in each 
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cluster: 



D(Xi, Xj) = max max d(x, y). (4) 



Starting from equation and following the same reasoning as in the single linkage case, we 
extend the definition of distance to the superclusters 

D{X h Xj) = max max D(X U Xj). (5) 



3.3 Unweighted Average 

Unweighted average clustering, also known as group average method or UPGMA (Unweighted 
Pair-Group Method using Averages), iteratively forms clusters made up of pairs of previously 
formed clusters, based on the arithmetic mean distances between their member individuals. 
It uses an unweighted averaging procedure, that is, when clusters are joined to form a larger 
cluster, the distance between this new cluster and any other cluster is calculated weighting 
each individual in those clusters equally, regardless of the structural subdivision of the clusters: 

D(X U X 3 ) = — ^— E d ( x 'V)- ( 6 ) 

When the variable-group strategy is followed, the UPGMA name of the method should be 
modified to that of UVGMA (Unweighted Variable-Group Method using Averages), and the 
distance definition between superclusters in this case should be 

D{X h Xj) = — ^— £ £ d(x,y) 

I 7 " J ' xdX iy &Xj 
1 11 1 i£l x€Xi jeJ y&Xj 

Using equation ([6]), we get the desired definition in terms of the distances between component 
clusters: 

D(X T , Xj) = — E MlXiMXi, X,). (7) 
I J H J ' iei jeJ 

In this case, \Xj\ is the number of individuals in supercluster Xj, that is, \Xj\ = ^2 ieI |Aj|. 



3.4 Weighted Average 

In weighted average strategy, also called WVGMA (Weighted Variable- Group Method using 
Averages) in substitution of the corresponding pair-group name WPGMA, we calculate the 
distance between two superclusters Xj and Xj by taking the arithmetic mean of the pairwise 
distances, not between individuals in the original matrix of distances, but between component 
clusters in the matrix used in the previous iteration of the procedure: 

D(Xj, Xj) = -L- E D ( X *> ( 8 ) 

' " ' iei j£J 
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This method is related to the unweighted average one in that the former derives from the 
latter when we consider 



\Xi\ = 1 Vie/ and \Xj\ = 1 Vj G J. (9) 

It weights the most recently admitted individuals in a cluster equally to its previous members. 
The weighting discussed here is with reference to individuals composing a cluster and not 
to the average distances in Lance and Williams' recursive formula (see next section), in 
which equal weights apply for weighted clustering and different weights apply for unweighted 
clustering (Sneath and Sokal 1973, p. 229). 



3.5 Unweighted Centroid 

The next three clustering techniques assume that individuals can be represented by points 
in Euclidean space. This method and the next one further assume that the measure of 
dissimilarity between any pair of individuals is the squared Euclidean distance between the 
corresponding pair of points. When the dissimilarity between two clusters Xj and Xj is defined 
to be the squared distance between their centroids, we are performing unweighted centroid 
(or simply centroid) clustering, also called UPGMC (Unweighted Pair-Group Method using 
Centroids) : 

D{X i ,X j ) = \\x i -x j \\ 2 , (10) 

where x~i and Xj are the centroids of the points in clusters Xi and Xj respectively, and || • || 
is the Euclidean norm. Therefore, under the variable-group point of view, the method could 
be named UVGMC and the distance between two superclusters can be generalized to the 
definition: 

D(Xj,Xj) = \\xj -xj\\ 2 . (11) 

In the Appendix it is proved that this definition can be expressed in terms of equation ffTUl) 

as 

D{Xl > Xj) = iX~TlX~7 £ £ \Xi\\XiMXi, x j) 
— iy i 2 \Xi\\Xii\D{X h X^) 

i'>i 



VVix^i^x,,^,). (12) 



1 J| jeJ j'eJ 



3.6 Weighted Centroid 

In weighted centroid strategy, also called median method or WVGMC (Weighted Variable- 
Group Method using Centroids) in substitution of the pair-group name WPGMC, we modify 
the definition of dissimilarity between two clusters given in the unweighted centroid case, 
assigning each cluster the same weight in calculating the "centroid". Now the center of a 
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supercluster Xj is the average of the centers of the constituent clusters: 



XI = TTT / , x i- 



This clustering method is related to the unweighted centroid one by relation which also 
related the weighted average strategy to the corresponding unweighted average. So, in this 
case we define the distance between two superclusters as 



D ( x i, x j) = tjwti E E x > ] 

iei jeJ 



17P E E D ^ - E E ^ ^')- ( 13 ) 



iei i'ei 1 1 ieJ fe.J 

i'>i j>>j 

3.7 Joint Between- Within 

Szekely and Rizzo (2005) propose an agglomerative hierarchical clustering method that mini- 
mizes a joint between-within cluster distance, measuring both heterogeneity between clusters 
and homogeneity within clusters. This method extends Ward's minimum variance method 
(Ward 1963) by defining the distance between two clusters Xi and Xj in terms of any power 
a e (0, 2] of Euclidean distances between individuals: 

w>-p^(p^£ Elixir 

^E£ik-xir-i^££ii»-</irY d4) 



\x... , 

When a = 2, cluster distances are a weighted squared distance between cluster centers 



2|**|| 






:| + 


l*il 



D{X i ,X J )= , ' '' ifo-a; 
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equal to twice the cluster distance that is used in Ward's method. 

In the Appendix we derive the following recursive formula for updating cluster distances 
generalization of equation (TT4j) : 

D(x h xj) = j |X/ i EE(i^i + i^mx^xj) 



i -t- 



iei jeJ 





Xj\ 






Xi\{\Xi\ 


+ \Xj\) 



^^{{Xil + lXffDDiX^Xi, 



iei i'ei 

i'>i 





Xi 






Xj\{\X T \ 


+ \Xj\) 



Y^\ X i\ + \ X S'\) D ^ x i')- (15) 



jeJ j'eJ 
j'>j 
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4 Generalization of Lance and Williams' Formula 



Lance and Williams (1966) put the most commonly used agglomerative hierarchical strate- 
gies into a single system, avoiding the necessity of a separate computer program for each 
of them. Assume three clusters X^, Xy and Xj, containing |JQ|, |JQ/| and \Xj\ individuals 
respectively and with distances between them already determined as D(Xi, X^), D(Xi,Xj) 
and D(Xi>,Xj). Further assume that the smallest of all distances still to be considered is 
D(Xi, Xj/), so that Xi and Xy are joined to form a new supercluster XiUXp, with \X{\ + \Xi>\ 
individuals. Lance and Williams express D(X{ U Xi>,Xj) in terms of the distances already 
defined, all known at the moment of fusion, using the following recurrence relation: 

D(X t U X v ,Xj) = otDiXuXj) + ou t D(X if ,Xj) 

+ /3D(Xi, X v ) + j\D(Xi, X 3 ) - D{X V ,X 3 )\. (16) 

With this technique superclusters can always be computed from previous clusters and it is not 
necessary to return to the original dissimilarity data during the clustering process. The values 
of the parameters a», a*/, (3 and 7 determine the nature of the sorting strategy. Table 1 gives 
the values of the parameters that define the most commonly used agglomerative hierarchical 
clustering methods. 

We next give a generalization of formula (1161) compatible with the amalgamation of more 
than two clusters simultaneously. Suppose that one wants to agglomerate two superclusters 
Xi and Xj, respectively indexed by I — {h,i2, ■ ■ ■ ,ip} and J = {ji, J2, • • • ,jq}- We define 
the distance between them as 

D{X h Xj) = Y,Y, a v D ^ X i) 

+ E E p« ,D ( x ii + E E ^r D i x v x r) 

iei i'ei jeJ j'eJ 

i>>i j>>j 

iei jeJ 

iei jeJ 

where 

D max {X h Xj) = max max D(X h Xj) 
iei jeJ 

and 

£> min (X/,Xj) = min min D(X h Xj). 

iei jeJ 

Table 2 shows the values for the parameters a^, flui, /3jf, 7^ and 5 which determine the 
clustering method computed by formula (FlTI) . They are all gathered from the respective 
formulae (j3J), (JSJ), (|7j), (jH), (|12|) . (|T3j) and (TT5T) . derived in the previous section. 

5 Glamorganshire Soils Example 

We show here a real example which has been studied by Morgan and Ray (1995) using the 
complete linkage method. It is the Glamorganshire soils example, formed by similarity data 
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between 23 different soils. A table with the similarities can be found also in Morgan and 
Ray (1995), where the values are given with an accuracy of three decimal places. In order 
to work with dissimilarities, first of all we have transformed the similarities s(xi,Xj) into the 
corresponding dissimilarities d(xi,Xj) = 1 — s(xi,Xj). 

The original data present a tied value for pairs of soils (3,15) and (3,20), which is respon- 
sible for two different dendrograms using the complete linkage strategy. We show them in 
Figures 5 and 6. Morgan and Ray (1995) explain that the 23 soils have been categorized into 
eight "great soil groups" by a surveyor. Focusing on soils 1, 2, 6, 12 and 13, which are the 
only members of the brown earths soil group, we see that the dendrogram in Figure 5 does 
not place them in the same cluster until they join soils from five other soil groups, forming 
the cluster (1, 2, 3, 20, 12, 13, 15, 5, 6, 8, 14, 18). From this point of view, the dendrogram in 
Figure 6 is better, since the corresponding cluster loses soils 8, 14 and 18, each representing 
a different soil group. So, in this case, we have two possible solution dendrograms and the 
probability of obtaining the "good" one is, hence, 50%. 

On the other hand, in Figure 7 we can see the multidendrogram corresponding to the 
Glamorganshire soils data. The existence of a tie comprising soils 3, 15 and 20 is clear from 
this tree representation. Besides, the multidendrogram gives us the good classification, that 
is, the one with soils 8, 14 and 18 out of the brown earths soil group. Except for the internal 
structure of the cluster (1, 2, 3, 15, 20), the rest of the multidendrogram hierarchy coincides 
with that of the dendrogram shown in Figure 6. 

Finally, notice that the incidence of ties depends on the accuracy with which proximity 
values are available. In this example, if dissimilarities had been measured to four decimal 
places, then the tie causing the non-unique complete linkage dendrogram might have disap- 
peared. On the contrary, the probability of ties is higher if lower accuracy data are used. 
For instance, when we consider the same soils data but with an accuracy of only two decimal 
places, we obtain the multidendrogram shown in Figure 8, where three different ties can be 
observed. 

6 Conclusions 

The non-uniqueness problem in agglomerative hierarchical clustering generates several hierar- 
chical classifications from a unique set of tied proximity data. In such cases, selecting a unique 
classification can be misleading. This problem has traditionally been dealt with distinct cri- 
teria, which mostly consist of the selection of one out of various resulting hierarchies. In this 
article we have proposed a variable-group algorithm for agglomerative hierarchical clustering 
that solves the ties in proximity problem. The output of this algorithm is a uniquely deter- 
mined type of valued tree, which we call a multivalued tree, while graphically we represent it 
with a multidendrogram. 

In addition we have generalized the definition of distance between clusters for the most 
commonly used agglomerative hierarchical methods, in order to be able to compute them 
using the variable-group algorithm. We have also given the corresponding generalization of 
Lance and Williams' formula, which enables us to get agglomerative hierarchical classifications 
in a recursive way. Finally, we have showed the possible usefulness of our proposal with some 
results obtained using data from a real example. 

Gathering up the main advantages of our new proposal, we can state the following points: 
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• When there are no ties, the variable-group algorithm gives the same result as the pair- 
group one. 



• The new algorithm always gives a uniquely determined solution. 

• In the multidendrogram representation for the results one can explicitly observe the 
occurrence of ties during the agglomerative process. Furthermore, the height of any 
fusion interval indicates the degree of heterogeneity inside the corresponding cluster. 

• When ties exist, the variable-group algorithm is computationally more efficient than 
obtaining all the possible solutions following out the various ties with the pair-group 



• The new proposal can be also computed in a recursive way using a generalization of 
Lance and Williams' formula. 

Although ties need not be present in the initial proximity data, they may arise during 
the agglomeration process. For this reason and given that the results of the variable-group 
algorithm coincide with those of the pair-group algorithm when there are not any ties, we 
recommend to use directly the variable-group option. With a single action one knows whether 
ties exist or not, and additionally the subsequent solution is obtained. 

Acknowledgments 

The authors thank A. Arenas for discussion and helpful comments. This work was partially 
supported by DGES of the Spanish Government Project No. FIS2006-13321-C02-02 and by 
a grant of Universitat Rovira i Virgili. 

A Appendix: Proofs 

A.l Proof for the Unweighted Centroid Method 

Given a cluster Xi, its centroid is 



and the centroid of a supercluster Xj can be expressed in terms of its constituent centroids 
by the equation: 



Now, given two superclusters Xj and Xj, the distance between them defined in equation (TTTT) 
is 



alternative. 






D(X T ,Xj) = \\x! - xj\\ 2 = Il^/H 2 + ||xj|| 2 - 2{x I ,x J ), 
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where (■, ■) stands for the inner product. If we substitute each centroid by its definition (JTS 
we obtain: 



D(Xj, 1;) = ^^^ \Xi | \X V | fa, 3*) 



i&I 



V 



jeJ j'eJ 



\Xi\\X, 



iei jeJ 



Now, since 
we have that 



2(x h xj) = \\xi\\ 2 + \\xj\\ 2 - \\xi - Xj\\ 2 , 



1^1 I*- 
But this can be rewritten as 



1 JM J| iei jeJ 1 111 Jl iei jeJ 



\ X i\\ X J\^f^ 



2 



and, grouping terms, 



' ^ iei ' / ' ie/ i'ei 

j^i E i^- 1 ifo ii 2 + E E i^i 1 i*> i 



^EEi^h^'Kii^ii 2 -^^)) 



iei i'ei 



\Xj 

^pEEi^H^'Kii^n 2 -^^)) 

1 J| jeJj'eJ 
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The second and third terms can be simplified a little more, thanks to the equality 

^2^2\Xi\\Xi>\ (\\xi\\ 2 - (xi,Xi>)) = 



iel i'ei 

i'>i 



With this simplification, we have that 



%i Xj | 



iei jeJ 

2 



X 



i'>i 



T^-H Yl Yl \ X i I \ X J' I W x i ~ X i' I 

I J \ A f~ J .♦/■»— T 



jeJ j'eJ 
i'>i 

and, recalling the definition of distance between two clusters given in equation (fTUl) . we finally 
obtain the desired form of equation (TP2|) . 

A. 2 Proof for the Joint Between- Within Method 

We give here a proof based on that of Szekely and Rizzo (2005) for their agglomerative 
hierarchical formulation. Using the following constants: 



^EE \\*-A a , as) 



I X 

the definition (THj) of distance between two clusters Xj and Xj is 



D(Xi,Xj) = — — — 3 (2% - 6"ii - 0jj). 











\Xi\ 


+ 


Xj\ 
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Consider now the superclusters Xj and Xj formed by merging clusters JQ, for alH G /, and 
Xj, for all j G J. Define the corresponding constants: 







^EEii*-»ii- 



WjEEEEii-^. 



Xi\\Xj 

^E E 



l\\a 



\x — X 



iG/ i'G/ i6X, x'eX, 



^EfEEH*-* t+EE E 



x — X 



iei y xeXi x'eXi 



i'>i 



so that in terms of the original constants (119]) we have 

9jj ~- 







u 



^E|>«%. 



e/ 

i'>i 



Therefore, the distance between superclusters Xj and Xj is given by 

\Xi\\Xj\ 



D(X T ,Xj 



\Xi\ + \x d 



-(2#/j — #// — djj) 



\X,\\Xj 



1 JM J| ie/ jGJ 



1^/1 + W 

^J2(\x t \ 2 e tt + 2^1X^16^ 



\Xi 



i'ei 

i'>i 



£(V/% + 2 E toil**') 
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Simplify 



J2[\X i \%i + 2j2\Xi\\X il \9 i 
iei ^ i'ei 

i'>i 

= |Xj| 2 #jj + |Xi|]Xj/|(20jj/ — Oii — Oi'i' + $ii + $i'i') 

iei L i'ei 

i'>i 

= l^| 2 ^ + ^l^ll^l(^ + ^') + ^(|X l | + |X l ,|) J D(X l ,X J 

iei L i'e/ i'ei 

i'>i i'>i 

= \x z \ \ x i\8a + E E(i x *i + i x i'i)^(^<> ^iO> 

ie/ ie/ i'e/ 

i'>i 

where in last equality we have used the equivalence 



16/ 



i'e/ 

i'>i 



i6/ 



Hence, 



(|X 7 | + IXjDU^.Xj) = 2 ^E 1^11^1% 



ie/ jeJ 



Xj| E \Xi\9u " jfr E Ed X *l + l^'l)^,^') 
ie/ ' 7 ' «6/ i'e/ 

i'>i 



or, equivalently, 



jeJ 1 J| jeJ feJ 



\X T \ + \Xj\)D{X!,Xj) = EEl^ll^'l 2 % 

ie/ jeJ 

ieJ ie/ jeJ 



J.7 



16/ 



1 /! ie/ i'e/ 



IX 



lEE^i + i^iW'^')' 



j^J j'eJ 

j'>3 
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which is also the same as 



(\Xj\ + \Xj\)D(Xr,Xj) = J2J2^\ + \X 3 \)D(X U X 3 ) 

iei jeJ 

' J ' iai i'el 

i'>i 



\x 



\x 



j'eJ j'gj 



And this is exactly the desired formulation given in equation f[T5]) . 
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Table 1. Parameter Values for the Lance and Williams' Formula 
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Table 2. Parameter Values for the Variable- Group Formula 



Method 



a 



Pi; 



Single linkage 
Complete linkage 

Unweighted average 
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i 
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\ Xl \ \Xj\ + \Xl] 
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Figure 1. Toy Graph with Four Individuals and Shortest Path Distances 
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Figure 2. Unweighted Average Dendrograms for the Toy Example 
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Figure 3. Unweighted Average Multidendrogram for the Toy Example 
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Figure 4. Simultaneous Occurrence of Different Superclusters 
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1 2 3 20 12 13 15 5 6 8 14 18 7 16 17 9 10 19 11 23 22 21 4 

Figure 5. First Complete Linkage Dendrogram for the Soils Data 
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Figure 6. Second Complete Linkage Dendrogram for the Soils Data 
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Figure 7. Complete Linkage Multidendrogram for the Soils Data 
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Figure 8. Complete Linkage Multidendrogram for the Soils Data with an Accuracy of Two Decimal Places 
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